35 research outputs found

    Improving the effective use of multithreaded architectures : implications on compilation, thread assignment, and timing analysis

    Get PDF
    This thesis presents cross-domain approaches that improve the effective use of multithreaded architectures. The contributions of the thesis can be classified in three groups. First, we propose several methods for thread assignment of network applications running in multithreaded network servers. Second, we analyze the problem of graph partitioning that is a part of the compilation process of multithreaded streaming applications. Finally, we present a method that improves the measurement-based timing analysis of multithreaded architectures used in time-critical environments. The following sections summarize each of the contributions. (1) Thread assignment on multithreaded processors: State-of-the-art multithreaded processors have different level of resource sharing (e.g. between thread running on the same core and globally shared resources). Thus, the way that threads of a given workload are assigned to processors' hardware contexts determines which resources the threads share, which, in turn, may significantly affect the system performance. In this thesis, we demonstrate the importance of thread assignment for network applications running in multithreaded servers. We also present TSBSched and BlackBox scheduler, methods for thread assignment of multithreaded network applications running on processors with several levels of resource sharing. Finally, we propose a statistical approach to the thread assignment problem. In particular, we show that running a sample of several hundred or several thousand random thread assignments is sufficient to capture at least one out of 1% of the best-performing assignments with a very high probability. We also describe the method that estimates the optimal system performance for given workload. We successfull y applied TSBSched, BlackBox scheduler, and the presented statistical approach to a case study of thread assignment of multithreaded network applications running on the UltraSPARC T2 processor. (2) Kernel partitioning of streaming applications: An important step in compiling a stream program to multiple processors is kernel partitioning. Finding an optimal kernel partition is, however, an intractable problem. We propose a statistical approach to the kernel partitioning problem. We describe a method that statistically estimates the performance of the optimal kernel partition. We demonstrate that the sampling method is an important part of the analysis, and that not all methods that generate random samples provide good results. We also show that random sampling on its own can be used to find a good kernel partition, and that it could be an alternative to heuristics-based approaches. The presented statistical method is applied successfully to the benchmarks included in the StreamIt 2.1.1 suite. (3) Multithreaded processors in time-critical environments: Despite the benefits that multithreaded commercial-of-the-shelf (MT COTS) processors may offer in embedded real-time systems, the time-critical market has not yet embraced a shift toward these architectures. The main challenge with MT COTS architectures is the difficulty when predicting the execution time of concurrently-running (co-running) time-critical tasks. Providing a timing analysis for real industrial applications running on MT COTS processors becomes extremely difficult because the execution time of a task, and hence its worst-case execution time (WCET) depends on the interference with co-running tasks in shared processor resources. We show that the measurement-based timing analysis used for single-threaded processors cannot be directly extended for MT COTS architectures. Also, we propose a methodology that quantifies the slowdown that a task may experience because of collision with co-running tasks in shared resources of MT COTS processor. The methodology is applied to a case study in which different time-critical applications were executed on several MT COTS multithreaded processors

    Main memory in HPC: do we need more or could we live with less?

    Get PDF
    This study analyzes the memory capacity requirements of important HPC benchmarks and applications. We find that the High Performance Conjugate Gradients benchmark could be an important success story for 3D-stacked memories in HPC, but High-performance Linpack is likely to be constrained by 3D memory capacity

    PROFET: modeling system performance and energy without simulating the CPU

    Get PDF
    The approaching end of DRAM scaling and expansion of emerging memory technologies is motivating a lot of research in future memory systems. Novel memory systems are typically explored by hardware simulators that are slow and often have a simplified or obsolete abstraction of the CPU. This study presents PROFET, an analytical model that predicts how an application's performance and energy consumption changes when it is executed on different memory systems. The model is based on instrumentation of an application execution on actual hardware, so it already takes into account CPU microarchitectural details such as the data prefetcher and out-of-order engine. PROFET is evaluated on two real platforms: Sandy Bridge-EP E5-2670 and Knights Landing Xeon Phi platforms with various memory configurations. The evaluation results show that PROFET's predictions are accurate, typically with only 2% difference from the values measured on actual hardware. We release the PROFET source code and all input data required for memory system and application profiling. The released package can be seamlessly installed and used on high-end Intel platforms.Peer ReviewedPostprint (author's final draft

    HPC benchmarking: scaling right and looking beyond the average

    Get PDF
    The final publication is available at Springer via http://dx.doi.org/10.1007/978-3-319-96983-1_10Designing a balanced HPC system requires an understanding of the dominant performance bottlenecks. There is as yet no well established methodology for a unified evaluation of HPC systems and workloads that quantifies the main performance bottlenecks. In this paper, we execute seven production HPC applications on a production HPC platform, and analyse the key performance bottlenecks: FLOPS performance and memory bandwidth congestion, and the implications on scaling out. We show that the results depend significantly on the number of execution processes and granularity of measurements. We therefore advocate for guidance in the application suites, on selecting the representative scale of the experiments. Also, we propose that the FLOPS performance and memory bandwidth should be represented in terms of the proportions of time with low, moderate and severe utilization. We show that this gives much more precise and actionable evidence than the average.This work was supported by the Spanish Ministry of Science and Technology (project TIN2015-65316-P), Generalitat de Catalunya (contracts 2014-SGR-1051 and 2014-SGR-1272), Severo Ochoa Programme (SEV-2015-0493) of the Spanish Government; and the European Union's Horizon 2020 research and innovation programme under ExaNoDe project (grant agreement No 671578).Peer ReviewedPostprint (author's final draft

    Main memory in HPC: do we need more, or could we live with less?

    Get PDF
    An important aspect of High-Performance Computing (HPC) system design is the choice of main memory capacity. This choice becomes increasingly important now that 3D-stacked memories are entering the market. Compared with conventional Dual In-line Memory Modules (DIMMs), 3D memory chiplets provide better performance and energy efficiency but lower memory capacities. Therefore, the adoption of 3D-stacked memories in the HPC domain depends on whether we can find use cases that require much less memory than is available now. This study analyzes the memory capacity requirements of important HPC benchmarks and applications. We find that the High-Performance Conjugate Gradients (HPCG) benchmark could be an important success story for 3D-stacked memories in HPC, but High-Performance Linpack (HPL) is likely to be constrained by 3D memory capacity. The study also emphasizes that the analysis of memory footprints of production HPC applications is complex and that it requires an understanding of application scalability and target category, i.e., whether the users target capability or capacity computing. The results show that most of the HPC applications under study have per-core memory footprints in the range of hundreds of megabytes, but we also detect applications and use cases that require gigabytes per core. Overall, the study identifies the HPC applications and use cases with memory footprints that could be provided by 3D-stacked memory chiplets, making a first step toward adoption of this novel technology in the HPC domain.This work was supported by the Collaboration Agreement between Samsung Electronics Co., Ltd. and BSC, Spanish Government through Severo Ochoa programme (SEV-2015-0493), by the Spanish Ministry of Science and Technology through TIN2015-65316-P project and by the Generalitat de Catalunya (contracts 2014-SGR-1051 and 2014-SGR-1272). This work has also received funding from the European Union’s Horizon 2020 research and innovation programme under ExaNoDe project (grant agreement No 671578). Darko Zivanovic holds the Severo Ochoa grant (SVP-2014-068501) of the Ministry of Economy and Competitiveness of Spain. The authors thank Harald Servat from BSC and Vladimir Marjanovi´c from High Performance Computing Center Stuttgart for their technical support.Postprint (published version

    Performance impact of a slower main memory: a case study of STT-MRAM in HPC

    Get PDF
    In high-performance computing (HPC), significant effort is invested in research and development of novel memory technologies. One of them is Spin Transfer Torque Magnetic Random Access Memory (STT-MRAM) --- byte-addressable, high-endurance non-volatile memory with slightly higher access time than DRAM. In this study, we conduct a preliminary assessment of HPC system performance impact with STT-MRAM main memory with recent industry estimations. Reliable timing parameters of STT-MRAM devices are unavailable, so we also perform a sensitivity analysis that correlates overall system slowdown trend with respect to average device latency. Our results demonstrate that the overall system performance of large HPC clusters is not particularly sensitive to main-memory latency. Therefore, STT-MRAM, as well as any other emerging non-volatile memories with comparable density and access time, can be a viable option for future HPC memory system design.This work was supported by the Collaboration Agreement between Samsung Electronics Co., Ltd. and BSC, Spanish Government through Programa Severo Ochoa (SEV-2015-0493), by the Spanish Ministry of Science and Technology through TIN2015-65316-P project and by the Generalitat de Catalunya (contracts 2014-SGR-1051 and 2014-SGR-1272). This work has also received funding from the European Union's Horizon 2020 research and innovation programme under ExaNoDe project (grant agreement No 671578).Peer ReviewedPostprint (author's final draft

    Main memory latency simulation: the missing link

    Get PDF
    The community accepted the need for a detailed simulation of main memory. Currently, the CPU simulators are usually coupled with the cycle-accurate main memory simulators. However, coupling CPU and memory simulators is not a straight-forward task because some pieces of the circuitry between the last level cache and the memory DIMMs could be easily overlooked and therefore not accounted for. In this paper, we take an approach to quantify the missing cycles in the main memory simulation. To that end, we execute a memory intensive microbenchmark to validate a simulation infrastructure based on ZSim and DRAMsim2 modeling an Intel Sandy Bridge E5-2670 system. We execute the same microbenchmark on a real Sandy Bridge E5-2670 machine identifying a missing 20 ns in the simulator measurements. This is a huge difference that, in the system under study, corresponds to one-third of the overall main memory latency. We propose multiple schemes to add an extra delay in the simulation model to account for the missing cycles. Furthermore, we validate the proposals using the SPEC CPU2006 benchmarks. Finally, we repeat the main memory latency measurements on seven mainstream and emerging computing platforms. Our results show that latency between the Last Level Cache (LLC) and the main memory ranges between tens and hundreds of nanoseconds, so we emphasize on properly adjust and validate these parameters in system simulators before any measurements are performed. Overall, we believe this study would improve main memory simulation leading to the better overall system analysis and explorations performed in the computer architecture community.This work was supported by the Collaboration Agreement between Samsung Electronics Co. Ltd. and BSC, Spanish Ministry of Science and Technology (project TIN2015-65316-P), Generalitat de Catalunya (contracts 2014-SGR-1051 and 2014-SGR-1272) and the Severo Ochoa Programme (SEV-2015-0493) of the Spanish Government.Peer ReviewedPostprint (author's final draft

    DRAM errors in the field: a statistical approach

    Get PDF
    This paper summarizes our two-year study of corrected and uncor-rected errors on the MareNostrum 3 supercomputer, covering 2000 billion MB-hours of DRAM in the field. The study analyzes 4.5 million corrected and 71 uncorrected DRAM errors and it compares the reliability of DIMMs from all three major memory manufacturers, built in three different technologies. Our work has two sets of contributions. First, we illustrate the complexity of in-field DRAM error analysis and demonstrate the limitations of various widely-used methods and metrics. For example, we show that average error rates, errors per MB-hour and mean time between failures can provide volatile and unreliable results even after long periods of error logging, leading to incorrect conclusions about DRAM reliability. Second, we present formal statistical methods that overcome many of the limitations of the current approaches. The methods that we present are simple to understand and implement, reliable and widely accepted in the statistical community. Overall, our study alerts the community about the need to, firstly, question the current practice in quantifying DRAM reliability and, secondly, to select a proper analysis approach for future studies. Our strong recommendations are to focus on metrics with a practical value that could be easily related to system reliability, and to select methods that provide stable results, ideally supported with statistical significance.This work was supported by the Collaboration Agreement between Samsung Electronics Co., Ltd. and BSC, Spanish Government through Severo Ochoa programme (SEV-2015-0493), by the Spanish Ministry of Science and Technology through TIN2015-65316-P project and by the Generalitat de Catalunya (contracts 2014-SGR1051 and 2014-SGR-1272). This work has also received funding from the European Union’s Horizon 2020 research and innovation programme under EuroEXA project (grant agreement No 754337). Darko Zivanovic holds the Severo Ochoa grant (SVP-2014-068501) of the Ministry of Economy and Competitiveness of Spain.Postprint (author's final draft

    Overhead of the spin-lock loop in UltraSPARC T2

    Get PDF
    Spin locks are task synchronization mechanism used to provide mutual exclusion to shared software resources. Spin locks have a good performance in several situations over other synchronization mechanisms, i.e., when on average tasks wait short time to obtain the lock, the probability of getting the lock is high, or when there is no other synchronization mechanism. In this paper we study the effect that the execution of spinlocks create in multithreaded processors. Besides going to multicore architectures, recent industry trends show a big move toward hardware multithreaded processors. Intel P4, IBM POWER5 and POWER6, Sun's UltraSPARC T1 and T2 all this processors implement multithreading in various degrees. By sharing more processor resources we can increase system's performance, but at the same time, it increases the impact that processes executing simultaneously introduce to each other.Postprint (published version
    corecore